18f / data-inventory Goto Github PK

4.0 12.0 3.0 15 KB

18F's contributions to the GSA enterprise data inventory and public data listing

data-inventory's Introduction

18F owns, generates, or manages several datasets at GSA. In support of the Open Data Policy, we want to ensure that they are appropriately included in the agency's data catalogs. This ensures that agency staff, other agencies, and the public can find this information as easily as possible.

The metadata for these datasets is managed this google sheet, to be regularly submitted to GSA's data team for incorporation into the agency's enterprise data inventory and public data listing.

The most up to date information is in this google sheet.
More information on the metadata schema used can be found here.
Are we missing a dataset? Let us know!
Our POC for this at GSA are [email protected], [email protected]

Regular Coworking Tasks

Consider additions to https://18f.gsa.gov/developer/
Check further how we're doing with our API reporting.
Review Open Issues.
Request metadata changes from Cindy.
Review current metadata records for potential updates.
Look at data.gov to review how the entries resolve.
Think through recent releases - are there any new records that need creating?
Ping team in #general-talk to ask for any updates.
Look at upcoming releases and try to coordinate the data rollout to include posting.

data-inventory's People

Contributors

Stargazers

Watchers

Forkers

arowla isabella232

data-inventory's Issues

EITI

Several datasets located at http://useiti.doi.gov/, but currently waiting on answer from Interior on whether they want to locate the data information in their data.gov or within ours. See DOI-ONRR/doi-extractives-data#557 for more details

Tock Timecards API

Current issues:

Needs POC
Should we include in data.json since private?

CSV validator and continuous integration?

If this becomes bigger (or the data.csv is read on a more automatic basis by the GSA), we should consider adding an automated script for validating the CSV (could look at csvkit) and possibly a CI step that would validate any CSV changes in pull requests

4/20/16 Edits to GSA's public data listing

add 18F as a keywords to:

FBOpen API
List of Government APIs
List of Government Developer Hubs

Update the publisher for the below entries from:

"publisher": {
"@type": "org:Organization",
"name": "General Services Administration"
},

to:

"publisher": {
"@type": "org:Organization",
"name": "OCSIT/18F",
"subOrganizationOf": {
"@type": "org:Organization",
"name": "General Services Administration"
}
},

List of Government Developer Hubs
List of Government APIs

The following projects are split into individual entries instead of using a distribution:

Digital Analytics Program data

Give Each Data Source Its Own Issue

The initial inventory outlined in Issue #1 has been a good start, but as we move towards a more regular workflow, it probably makes sense to instead use Github Issues to track the status of adding new datasets to the data.json. I'm proposing the following process:

Issues should be created for new dataset/API associated with an entry. This means that if an application has 8 datasets associated with it like Analytics, there will be 6 issues. This might seem repetitive, but it is only a necessary distinction for a few cases.
We will track the status of a dataset's entry by using labels that are assigned/removed to reflect the current state of that dataset in the GSA's JSON. For more details on the states, see below.
All of these issues should be tagged with a special dataset label to distinguish these issues from issues about the data-inventory as a whole.

We can think of a given dataset as going through a sequence of states that would be labels (names need to be considered):

Assigned, the information for this dataset needs to be filled out
Ready For Review, the row for the dataset has been filled out in the database for Data Inventory team to review
Reviewed by 18F, the row looks good, can be submitted to GSA data.json team
Submitted To GSA - the record has been submitted to GSA data.json
Accepted by GSA - the record was applied into the GSA's data.json
Rejected - the record should not be added to the data.json
Removed - the record was removed from the GSA's data.json and spreadsheet
Revised - when the application maintainer needs to edit an existing record, this is set. From here, flows to REVIEW_DONE, SUBMITTED, ACCEPTED as first-time

Maintainers should explicitly provide context when necessary after changing a dataset's status from 1 state to another. Specifically, if a dataset record is rejected or revised, the reason should be provided.

In addition, I would like to add a few special format labels: API, CSV, XML and JSON

A few open questions:

It's a bit awkward to discuss general questions about an application if there are N records for each dataset in there. Should we have a convention of a single parent Application Issue that links to each of the specific Dataset issues
Can we put the issue number for the dataset as an extra column in the google sheet/CSV?
Should we prefix these states with some sort of sequential prefix like "DS0: ASSIGNED"
Should the REVISED tag be eliminated and we just use READY FOR REVIEW with a comment that this is a revision to an existing dataset?

Datasets to add

Add notes here for any datasets that we should add to 18F's submission for GSA's data catalog.

Entry for 18F Blog RSS feed

Other RSS feeds are in the GSA data.json file (see below). We should add an entry for the 18F blog RSS feed.

{
@type: "dcat:Dataset",
title: "GobiernoUSA.gov Blog RSS feed",
description: "We help you find official U.S. government information and services in Spanish on the Internet.",
modified: "2016-01-07",
accessLevel: "public",
identifier: "GSA-2014-02-14-2",
dataQuality: true,
describedBy: "http://www.rssboard.org/rss-specification",
license: "https://creativecommons.org/publicdomain/zero/1.0/",
rights: "N/A",
spatial: "National",
publisher: {
@type: "org:Organization",
name: "General Services Administration"
},
contactPoint: {
@type: "vcard:Contact",
fn: "Russell G O'Neill",
hasEmail: "mailto:[email protected]"
},
distribution: [
{
@type: "dcat:Distribution",
mediaType: "text/html",
format: "text/html",
title: "GobiernoUSA.gov Blog RSS feed",
downloadURL: "http://blog.gobiernousa.gov/rss"
}
],
keyword: [
"Blog",
"Consumer",
"Services in Spanish",
"government benefits",
"government information",
"government services",
"news"
],
bureauCode: [
"023:00"
],
programCode: [
"023:014"
],
language: [
"en-us"
],
theme: [
"Other"
]
}

Add CALC data files from Github repo

There are CSV files in the Github repo that should be linked

Hub API v2

URL: https://hub.18f.gov/api/v1/

FBI Crime Data Explorer

The FBI Crime Data Explorer has an API that will go live approximately March 2017. Given this is a project for the DOJ and the DOJ maintains their own data.json, this will possible not be something we handle, but putting it here.

Needed information

URL
Data dictionary
POC name/email

NotAlone API

This is based on Beckley: https://github.com/18f/beckley
Found at https://api.data.gov/beckley/v0/resources/notalone/?q=harvard&size=200&from=0&api_key=

Remove entry for MyUSA

It seems that MyUSA is being deprecated and projects that use it for authentication are being migrated away. When MyUSA is no longer in service, we should remove the listing from the data.json file

18F Team API

As noted by @mbland, this API supersedes the current Hub APIs (see issues #11 and #12)

Midas/Open Opportunities

URL: https://openopps.digitalgov.gov/api
Docs: https://pages.18f.gov/midas/developer/

Micropurchase API

Site: https://micropurchase.18f.gov/api/
Docs: https://pages.18f.gov/micropurchase-api-docs/

Main questions:

Needs a person to be point-of-contact (@vzvenyach)
Suitable keywords in data.gov listing
Description of the API?

Eventually:

Update listing to reflect Swagger documentation if available

Revision to MyUSA API entry

The MyUSA API was already within the GSA's data.json file, but I emailed to make four changes:

Change the POC to be Eric Maland
Distribution's Format field should be "API" and not "None"
Change name to MyUSA API
Change description to "MyUSA API for authentication and task assignment"

3-3-16 inventory additions

For sending to Cindy & co. - https://docs.google.com/a/gsa.gov/spreadsheets/d/1-xClfVetAMnfAnquwjgSGunZOhpMLFzAOx7Xs8D1YXw/edit?usp=sharing

Unresolved questions about the differences between Open Metadata Schema vs. GSA CSV input

I was reading over the Project Open Metadata schema again, and I noted a few things that might be issues in the CSV format used by the GSA's team for specifying new entries for the data.json file. I wanted to outline them here so we can follow up with them on it.

The modified field should be an interval if a dataset is continually updated. Do they use accrualPeriodicity instead of the Modified column of the spreadsheet if the dataset's periodicity is not irregular? This makes sense for datasets updated every minute, but what about a dataset that is updated once a month?
The schema specifies that contactPoint is just This is a container for two fields that together make up the contact information for the dataset. The vCard format assumes this is a person, but does it really need to be? I just want to confirm this.
The schema supports a landingPage field which is human-friendly landing page for the data that is not necessarily a data dictionary. Do we need to support that?
It looks like the describedBy and describedByType fields are meant to be used for human-readable descriptions when at the top-level and machine-readable when within distribution records.
The schema definition allows users to specify one or more references for documentation that is not a data dictionary. Do we have a need to use this anywhere?

It also looks like we have some issues with how we have represented APIs in our sample CSV. Will follow up with a related issue

College Scorecard

This data is already part of the Department of Education data.json file. No need for us to take any action.

OpenFEC

URL: https://api.open.fec.gov/developers

Issues:

Update frequency?
POC for it
Okay to include in our data.json? (FEC not a CFO agency required to have data.json)

Get teams to review entries

Ensure that each record is reviewed by the project team

OpenFEC

OpenFOIA

Docs: http://foia-hub.readthedocs.org/en/latest/api.html#listing-agencies

Current Issues

API URL: ?
Specify POC
Okay to put in 18F data.json or does DOJ want it?

C2 internal API

https://github.com/18F/C2/blob/master/doc/api.md

I think we asked them about this a couple years ago and passed it by for being internal but actually think we should add it now.

Remove entry for FBOpen API?

It looks like the decision has been made to discontinue FBOpen but the API is still in operation while this is being decided. If the API is discontinued, we should remove the entry for the API from the data.json listing.

What should we include in our (18F) data.json?

Should we be leveraging the about.yml format instead?

I'm wondering if we should figure out a better way to use the about.yml format to get the information we need to generate and keep this data up to date. There are several options we could explore

Populate/track some of our data.json data from the existing about.yml fields:

The POC email is in the current about.yml, combined with lookup against team API, this would answer a lot of our needs for finding/tracking POC on content
The links section maps to references. We could also look for specifically tagged links for API documentation.

We could also consider tweaking the about_yml format some more to allow apps to indicate datasets/API distributions within the about.yml file itself. This would allow us to distribute the work of entering/keeping this data up to date across all the projects:

Perhaps add a datasets and apis tree of the YML with subfields that apply (although in some cases we could also infer them)
A process for generating the GSA's requested CSV (or some other format suitable for them) from the about.yml files
An internal site for tracking which APIs/datasets have been posted to the GSA already (turning this workflow into an app instead)
A way of noticing changes between scans and flagging us so we can notify GSA about them

It's a bit more work in the short run, but it has an advantage of decentralizing our own work collecting/maintaining these records so in the long run it might be easier.

What do you think @gbinal and @mbland?

CALC API

URL: https://calc.gsa.gov/api/rates/
Docs: https://github.com/18F/calc/blob/master/README.md

2-11-16 Edits to GSA's public data listing

For "List of Government APIs" (GSA - 139011), the accessURL should be changed from https://raw.githubusercontent.com/18F/API-All-the-X/gh-pages/_data/individual_apis.yml to https://pages.18f.gov/API-All-the-X/data/individual_apis.json. Format should be changed from API to JSON.

For "List of Government Developer Hubs" (GSA - 139012), the accessURL should be changed from https://raw.githubusercontent.com/18F/API-All-the-X/gh-pages/_data/developer_hubs.yml to https://pages.18f.gov/API-All-the-X/data/developer_hubs.json. Format should be changed from html to JSON.

api.data.gov API

This is an entry for the umbrella admin API itself
Documentation: https://api.data.gov/developer/
Endpoint: https://api.data.gov/api-umbrella/

Issues:

Needs POC

Hub API v1

Location: https://18f.gsa.gov/hub/api/
Current status: Drafts