Code Monkey home page Code Monkey logo

data-inventory's Introduction

18F owns, generates, or manages several datasets at GSA. In support of the Open Data Policy, we want to ensure that they are appropriately included in the agency's data catalogs. This ensures that agency staff, other agencies, and the public can find this information as easily as possible.

The metadata for these datasets is managed this google sheet, to be regularly submitted to GSA's data team for incorporation into the agency's enterprise data inventory and public data listing.

Regular Coworking Tasks

  • Consider additions to https://18f.gsa.gov/developer/
  • Check further how we're doing with our API reporting.
  • Review Open Issues.
  • Request metadata changes from Cindy.
  • Review current metadata records for potential updates.
  • Look at data.gov to review how the entries resolve.
  • Think through recent releases - are there any new records that need creating?
  • Ping team in #general-talk to ask for any updates.
  • Look at upcoming releases and try to coordinate the data rollout to include posting.

data-inventory's People

Contributors

arowla avatar gbinal avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

data-inventory's Issues

Tock Timecards API

Current issues:

  • Needs POC
  • Should we include in data.json since private?

CSV validator and continuous integration?

If this becomes bigger (or the data.csv is read on a more automatic basis by the GSA), we should consider adding an automated script for validating the CSV (could look at csvkit) and possibly a CI step that would validate any CSV changes in pull requests

4/20/16 Edits to GSA's public data listing

add 18F as a keywords to:

  • FBOpen API
  • List of Government APIs
  • List of Government Developer Hubs

Update the publisher for the below entries from:

"publisher": {
"@type": "org:Organization",
"name": "General Services Administration"
},

to:

"publisher": {
"@type": "org:Organization",
"name": "OCSIT/18F",
"subOrganizationOf": {
"@type": "org:Organization",
"name": "General Services Administration"
}
},
  • List of Government Developer Hubs
  • List of Government APIs

The following projects are split into individual entries instead of using a distribution:

  • Digital Analytics Program data

Give Each Data Source Its Own Issue

The initial inventory outlined in Issue #1 has been a good start, but as we move towards a more regular workflow, it probably makes sense to instead use Github Issues to track the status of adding new datasets to the data.json. I'm proposing the following process:

  1. Issues should be created for new dataset/API associated with an entry. This means that if an application has 8 datasets associated with it like Analytics, there will be 6 issues. This might seem repetitive, but it is only a necessary distinction for a few cases.
  2. We will track the status of a dataset's entry by using labels that are assigned/removed to reflect the current state of that dataset in the GSA's JSON. For more details on the states, see below.
  3. All of these issues should be tagged with a special dataset label to distinguish these issues from issues about the data-inventory as a whole.

We can think of a given dataset as going through a sequence of states that would be labels (names need to be considered):

  • Assigned, the information for this dataset needs to be filled out
  • Ready For Review, the row for the dataset has been filled out in the database for Data Inventory team to review
  • Reviewed by 18F, the row looks good, can be submitted to GSA data.json team
  • Submitted To GSA - the record has been submitted to GSA data.json
  • Accepted by GSA - the record was applied into the GSA's data.json
  • Rejected - the record should not be added to the data.json
  • Removed - the record was removed from the GSA's data.json and spreadsheet
  • Revised - when the application maintainer needs to edit an existing record, this is set. From here, flows to REVIEW_DONE, SUBMITTED, ACCEPTED as first-time

Maintainers should explicitly provide context when necessary after changing a dataset's status from 1 state to another. Specifically, if a dataset record is rejected or revised, the reason should be provided.

In addition, I would like to add a few special format labels: API, CSV, XML and JSON

A few open questions:

  • It's a bit awkward to discuss general questions about an application if there are N records for each dataset in there. Should we have a convention of a single parent Application Issue that links to each of the specific Dataset issues
  • Can we put the issue number for the dataset as an extra column in the google sheet/CSV?
  • Should we prefix these states with some sort of sequential prefix like "DS0: ASSIGNED"
  • Should the REVISED tag be eliminated and we just use READY FOR REVIEW with a comment that this is a revision to an existing dataset?

Datasets to add

Add notes here for any datasets that we should add to 18F's submission for GSA's data catalog.

Entry for 18F Blog RSS feed

Other RSS feeds are in the GSA data.json file (see below). We should add an entry for the 18F blog RSS feed.

{
@type: "dcat:Dataset",
title: "GobiernoUSA.gov Blog RSS feed",
description: "We help you find official U.S. government information and services in Spanish on the Internet.",
modified: "2016-01-07",
accessLevel: "public",
identifier: "GSA-2014-02-14-2",
dataQuality: true,
describedBy: "http://www.rssboard.org/rss-specification",
license: "https://creativecommons.org/publicdomain/zero/1.0/",
rights: "N/A",
spatial: "National",
publisher: {
@type: "org:Organization",
name: "General Services Administration"
},
contactPoint: {
@type: "vcard:Contact",
fn: "Russell G O'Neill",
hasEmail: "mailto:[email protected]"
},
distribution: [
{
@type: "dcat:Distribution",
mediaType: "text/html",
format: "text/html",
title: "GobiernoUSA.gov Blog RSS feed",
downloadURL: "http://blog.gobiernousa.gov/rss"
}
],
keyword: [
"Blog",
"Consumer",
"Services in Spanish",
"government benefits",
"government information",
"government services",
"news"
],
bureauCode: [
"023:00"
],
programCode: [
"023:014"
],
language: [
"en-us"
],
theme: [
"Other"
]
}

FBI Crime Data Explorer

The FBI Crime Data Explorer has an API that will go live approximately March 2017. Given this is a project for the DOJ and the DOJ maintains their own data.json, this will possible not be something we handle, but putting it here.

Needed information

  • URL
  • Data dictionary
  • POC name/email

Remove entry for MyUSA

It seems that MyUSA is being deprecated and projects that use it for authentication are being migrated away. When MyUSA is no longer in service, we should remove the listing from the data.json file

Revision to MyUSA API entry

The MyUSA API was already within the GSA's data.json file, but I emailed to make four changes:

  1. Change the POC to be Eric Maland
  2. Distribution's Format field should be "API" and not "None"
  3. Change name to MyUSA API
  4. Change description to "MyUSA API for authentication and task assignment"

Unresolved questions about the differences between Open Metadata Schema vs. GSA CSV input

I was reading over the Project Open Metadata schema again, and I noted a few things that might be issues in the CSV format used by the GSA's team for specifying new entries for the data.json file. I wanted to outline them here so we can follow up with them on it.

  • The modified field should be an interval if a dataset is continually updated. Do they use accrualPeriodicity instead of the Modified column of the spreadsheet if the dataset's periodicity is not irregular? This makes sense for datasets updated every minute, but what about a dataset that is updated once a month?
  • The schema specifies that contactPoint is just This is a container for two fields that together make up the contact information for the dataset. The vCard format assumes this is a person, but does it really need to be? I just want to confirm this.
  • The schema supports a landingPage field which is human-friendly landing page for the data that is not necessarily a data dictionary. Do we need to support that?
  • It looks like the describedBy and describedByType fields are meant to be used for human-readable descriptions when at the top-level and machine-readable when within distribution records.
  • The schema definition allows users to specify one or more references for documentation that is not a data dictionary. Do we have a need to use this anywhere?

It also looks like we have some issues with how we have represented APIs in our sample CSV. Will follow up with a related issue

Remove entry for FBOpen API?

It looks like the decision has been made to discontinue FBOpen but the API is still in operation while this is being decided. If the API is discontinued, we should remove the entry for the API from the data.json listing.

Should we be leveraging the about.yml format instead?

I'm wondering if we should figure out a better way to use the about.yml format to get the information we need to generate and keep this data up to date. There are several options we could explore

Populate/track some of our data.json data from the existing about.yml fields:

  1. The POC email is in the current about.yml, combined with lookup against team API, this would answer a lot of our needs for finding/tracking POC on content
  2. The links section maps to references. We could also look for specifically tagged links for API documentation.

We could also consider tweaking the about_yml format some more to allow apps to indicate datasets/API distributions within the about.yml file itself. This would allow us to distribute the work of entering/keeping this data up to date across all the projects:

  1. Perhaps add a datasets and apis tree of the YML with subfields that apply (although in some cases we could also infer them)
  2. A process for generating the GSA's requested CSV (or some other format suitable for them) from the about.yml files
  3. An internal site for tracking which APIs/datasets have been posted to the GSA already (turning this workflow into an app instead)
  4. A way of noticing changes between scans and flagging us so we can notify GSA about them

It's a bit more work in the short run, but it has an advantage of decentralizing our own work collecting/maintaining these records so in the long run it might be easier.

What do you think @gbinal and @mbland?

2-11-16 Edits to GSA's public data listing

For "List of Government APIs" (GSA - 139011), the accessURL should be changed from https://raw.githubusercontent.com/18F/API-All-the-X/gh-pages/_data/individual_apis.yml to https://pages.18f.gov/API-All-the-X/data/individual_apis.json. Format should be changed from API to JSON.

For "List of Government Developer Hubs" (GSA - 139012), the accessURL should be changed from https://raw.githubusercontent.com/18F/API-All-the-X/gh-pages/_data/developer_hubs.yml to https://pages.18f.gov/API-All-the-X/data/developer_hubs.json. Format should be changed from html to JSON.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.